the back-propagation equations for a convolutional network

In this post, we'll derive the back-propagation equations for our convolutional net, which has the structure shown below. The square blocks refer to 2D matrices, while the rectangles represent matrices with a single row. They are drawn vertically, however, to make the diagram cleaner.

matrix	size	meaning
$\textbf{k}^i$	$1\times n^k$	i$^{th}$ kernel
$\textbf{F}$	$n^k \times n^0$	image frame captures
$\textbf{l}^{ci}$	$1\times n^0$	i$^{th}$ pre-pooled layer
$\textbf{l}^0$	$1\times n^0$	pooled convolution layer
$\textbf{w}^{01}$	$n^0 \times n^1$	weights from layer 0 to 1
$\textbf{l}^1$	$1\times n^1$	hidden layer
$\textbf{w}^{12}$	$n^1 \times n^2$	weights from layer 1 to 2
$\textbf{l}^2$	$1\times n^2$	output layer

In terms of components, the convolutional layer is given by

$$ l_i^{cj}=\sum_m k_m^j F_{mi}, $$

and

$$ l_k^0 = \tanh \left[ \max \left(l_k^{c0},l_k^{c1},...,l_k^{cN}\right) \right]. $$

For layers 1 and 2 we have

$$ l_q^1=\tanh \left[\sum_r l_r^0 w_{rq}^{01}\right]. $$

and

$$ l_n^2 = \sigma \left[\sum_l l_l^1 w_{ln}^{12} \right], $$

where $\sigma$ is the softmax function. The loss function is

$$ \epsilon = \frac{1}{2}\sum_p \left(l_p^2-y_p \right)^2, $$

and by applying the chain rule to these equations we will derive the back-propagation equations.

Similarly to the 3 layer non-convolutional net, we have

$$ \frac{\partial \epsilon}{\partial w_{qs}^{12}}=\left[\textbf{l}^1\otimes\left(\textbf{l}^2-\textbf{y}\right)\textbf{D}_{s'}\right]_{qs} $$

and

$$ \frac{\partial \epsilon}{\partial w_{qs}^{01}}=\left[\textbf{l}^0 \otimes \left(\textbf{l}^2-\textbf{y}\right)\textbf{D}_{\sigma'}\textbf{w}^{12,T}\textbf{D}_{t'}^1\right]_{qs}. $$

but now in addition to these we have

$$ \frac{\partial \epsilon}{\partial k_q^i}=\sum_{n,m,p}\frac{\partial \epsilon}{\partial l_m^2}\frac{\partial l_m^2}{\partial l_n^1}\frac{\partial l_n^1}{\partial l_p^0}\frac{\partial l_p^0}{\partial k_q^i}. $$

These derivatives are

$$ \frac{\partial \epsilon}{\partial l_m^2}=\sum_p \left(l_p^2-y_p\right)\delta_{pm}=\left(l_m^2-y_m\right), $$

$$ \frac{\partial \l_m^2}{\partial l_n^1}=\sigma_m'\sum_l w_{lm}^{12}\delta_{ln}=\sigma_m' w_{nm}^{12}, $$

$$ \frac{\partial l_n^1}{\partial l_p^0}=t_n'\sum_r \delta_{pr}w_{rn}^{01}=t_n'w_{pn}^{01}, $$

and

$$ \frac{\partial l_p^0}{\partial k_q^i}=\frac{\partial}{\partial k_q^i}\{t\left[\max(l_p^{c0},l_p^{c1},...,l_p^{cN})\right]\}=t_p^{0'}\frac{\partial l_p^{c,w(p)}}{\partial k_q^i}, $$

where $w(p)$ is the index of the largest $l_p^{ci}$ ("w" is for "winner"). This last derivative is given by

$$ \frac{\partial l_p^{c,w(p)}}{\partial k_q^i}=\frac{\partial}{\partial k_q^i}\left[\sum_m k_m^{w(p)}F_{mp}\right]=\sum_m F_{mp}\delta^{i,w(p)}\delta_{mq}=F_{qp}\delta^{i,w(p)}. $$

Defining

$$ \mathcal{F}_{qp}^i\equiv F_{qp}\delta^{i,w(p)}, $$

this result can be written more compactly as

$$ \frac{\partial l_p^{c,w(p)}}{\partial k_q^i}=\mathcal{F}_{qp}^i, $$

so that

$$ \frac{\partial l_p^0}{\partial k_q^i}=t_p^{0'}\mathcal{F}_{qp}^i. $$

Hence, for the components of the gradient corresponding to the kernels, we obtain

$$ \frac{\partial \epsilon}{\partial k_{q}^i}=\sum_{n,m,p}\left(l_m^2-y_m\right)\sigma_m' w_{mn}^{12,T}t_n^{1'} w_{np}^{01,T}t_p^{0'}\mathcal{F}_{pq}^{i,T}, $$

which can be written as

$$ \frac{\partial \epsilon}{\partial k_{q}^i}=\left[\left(\textbf{l}^2-\textbf{y}\right)\textbf{D}_{\sigma'}\textbf{w}^{12,T}\textbf{D}_{t'}^1\textbf{w}^{01,T}\textbf{D}_{t'}^0 \mathcal{F}^{i,T} \right]_q. $$

Discuss on Twitter

matrix	size	meaning
\(\textbf{k}^i\)	\(1\times n^k\)	i\(^{th}\) kernel
\(\textbf{F}\)	\(n^k \times n^0\)	image frame captures
\(\textbf{l}^{ci}\)	\(1\times n^0\)	i\(^{th}\) pre-pooled layer
\(\textbf{l}^0\)	\(1\times n^0\)	pooled convolution layer
\(\textbf{w}^{01}\)	\(n^0 \times n^1\)	weights from layer 0 to 1
\(\textbf{l}^1\)	\(1\times n^1\)	hidden layer
\(\textbf{w}^{12}\)	\(n^1 \times n^2\)	weights from layer 1 to 2
\(\textbf{l}^2\)	\(1\times n^2\)	output layer